fix(gateway): drive fault transport with its own executor#402
Open
bburda wants to merge 2 commits into
Open
Conversation
Ros2FaultServiceTransport::get_snapshots / get_rosbag (and the other RPC methods) previously called rclcpp::Client::async_send_request and blocked on the std::future. That required some external executor to be spinning the host node, and TSan caught a race between the spin thread cleaning up the client's pending request entry (execute_client) and the calling thread destroying the local response shared_ptr when the function returned. Give the transport its own MutuallyExclusive callback group, registered on a private SingleThreadedExecutor, and drive each RPC via executor_.spin_until_future_complete(). The client callback execution and the response destruction now happen sequentially on the caller's thread, which removes the cross-thread race entirely. The seven per-method mutexes collapse into one executor mutex - the underlying SingleThreadedExecutor is not safe for concurrent spin_until_future_complete callers, so all RPCs serialise through it. The host node's executor no longer needs to know about these clients. The new callback group is created once at construction (single-threaded) and stays on the transport's private executor, so the rcl hash-map race that the no-naked-subscriptions gate guards against does not apply. Add the file to ALLOWED_LEGACY_FILES with a note pointing back at this issue. Closes #399
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors Ros2FaultServiceTransport to remove the dependency on an externally-spun node executor by driving fault-manager service clients via a transport-owned SingleThreadedExecutor and synchronous spin_until_future_complete(), addressing the TSan race described in issue #399.
Changes:
- Create all 7 fault-manager service clients in a dedicated
MutuallyExclusivecallback group that is not auto-attached to the host node executor. - Add a private
SingleThreadedExecutorinRos2FaultServiceTransportand serialize all RPCs through anexecutor_mutex_while synchronously spinning futures. - Allowlist the transport source file in
check_no_naked_subscriptions.shdue to itscreate_callback_group()usage (with issue reference).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cpp | Moves fault-manager RPC execution to a private executor + synchronous spinning; collapses per-method locks into a single executor lock. |
| src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2/transports/ros2_fault_service_transport.hpp | Updates transport documentation and adds members for callback group, private executor, and executor mutex. |
| scripts/check_no_naked_subscriptions.sh | Adds the transport file to the allowlist to account for intentional callback-group creation (issue #399). |
Build the seven fault service clients on a private rclcpp::Node owned by Ros2FaultServiceTransport instead of a dedicated callback group on the host node. The private node is driven by the transport's own SingleThreadedExecutor via spin_until_future_complete(), so the host gateway node's executor never processes these clients and the client callback / response destruction stay on the caller's thread. This avoids the create_client(name, rclcpp::QoS, group) overload, which does not exist on Humble (rclcpp there only ships the rmw_qos_profile_t form), and keeps the no-naked-subscriptions regression gate satisfied without an allowlist entry, since create_client and create_node are not gated APIs. All seven RPC methods now share one consistent shape: build the request unlocked, then under executor_mutex_ wait for the service, send, spin to completion and take the response; response translation happens unlocked. Adds a regression test that completes a get_snapshots RPC while the host node is never spun, proving the transport drives its private client itself.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors
Ros2FaultServiceTransportto drive its rclcpp service clients witha private
SingleThreadedExecutorandspin_until_future_complete()insteadof blocking on a
std::future::wait_for()that requires some external threadto spin the host node.
The seven service clients are now created against a dedicated
MutuallyExclusivecallback group that is explicitly not auto-attached tothe host node's executor. The transport owns the executor that drives this
group, so each RPC's client callback runs synchronously on the caller's
thread inside
spin_until_future_complete. The pending-request cleanup andthe response destruction therefore happen sequentially on the same thread,
which eliminates the TSan-reported race between the gateway's spin thread
calling
rclcpp::Executor::execute_clientand the caller destroying thelocal
responseshared_ptr at the end of the function.The seven per-method mutexes collapse into one
executor_mutex_-SingleThreadedExecutor::spin_until_future_complete()is not safe forconcurrent callers on the same executor, so all RPCs serialise through it.
This is the explicit tradeoff the issue accepts in exchange for losing the
external-spin-thread dependency.
Issue
Type
Testing
colcon build --packages-select ros2_medkit_gatewaycleancolcon test --packages-select ros2_medkit_gateway --ctest-args -LE "linter|integration":2044 tests / 0 failures
test_fault_manager(18 cases) passes - covers the previously racyget_snapshots/get_rosbagpaths and the namespaced variantsclang-tidy on changed files)
the CI sanitiser jobs will validate the race is gone
The callback group is created once at transport construction (single-threaded)
and stays on the transport's private executor. The
rclhash-map race thatcheck_no_naked_subscriptions.shguards against does not apply, so the fileis added to
ALLOWED_LEGACY_FILESwith a note referencing this issue.Checklist